Lesson Duration: 1 hour
You might have heard of the ‘R vs. Python’ debate for data science. Both are open-source languages that can be used to do data analysis.
In this course we will be using R. You will already have encountered the basics in the pre-course work.
The best way to understand the difference between R and Python is to know a bit about their history.
R was created by Ross Ihaka and Robert Gentleman in 1995 at the University of Auckland. Both authors of R worked as statisticians, and were designing R to help to with statistical analysis. R is based on a language called S, which was also a statistical computing language.
Python was created by Guido van Rossum in 1991. Python is a general purpose language, that can be used for data analysis, but can also be used for building websites, writing scripts and building applications. Python is a very popular language for learning to code.
R was designed for statistics. Python was designed primarily as a programming language.
Historically, R had a wider range of statistical analysis and visualisation tools.
Python was better designed as a programming language, making it easier to use.
However, the gap between the two languages is closing. Modern R libraries are very well designed and easy to use. The number of Python packages to do data analysis is growing all the time.
Currently, we think that the main libraries for working with data and visualising data are easier to use in R, so we will mainly be using R. However, Python is still an excellent data analysis language. We will be introducing you to Python at the end of the course, so you will be able to compare and contrast.
Both R and Python are open source languages. This means that:
Obviously being free to use is a big advantage. But the ability to expand is another huge advantage. Both R and Python have numerous external libraries that let you do things outside of the main language. We will be learning about several R libraries (often called R packages) in this course.
We are going to use an IDE (interactive development environment) called RStudio. While you don’t need it to write R code, RStudio makes it easy to do things like:
R is free and open source. It is written by volunteers and all the all packages you’ll use were also written by volunteers.
RStudio is also free and open source, but is made by a profit making company. They make their money by selling a professional version of RStudio that runs on a sever and has support.
First you will need to install the latest version of R, which is available here: https://cloud.r-project.org/
And then you install RStudio: https://www.rstudio.com/products/rstudio/download/
You want to select RStudio desktop open source edition, which is free.
Your RStudio should look like this
This is your R script - this is where you write the code you want to keep. Most of the time you will be typing here.
There are several tabs here. The most important one is the environment tab. This shows you the objects you have created from writing code.
This is the console. This is where R code you write in the top left window gets run. You also might write code directly in here if you know you don’t want to keep it.
Again, there are several tabs here. The most important ones are
Before we start using RStudio we’re going to set some options that will make working in RStudio easier.
The first and most important option to change is under the Workspace heading.
This means that every time you start RStudio, only the code you have written will be saved. You won’t be saving the objects made and the libraries loaded.
This is important because it makes your code reproducible, this means if you send your code to someone else they should be able to run it and get the same results as you.
Now for something more fun.
From here you can change how RStudio looks. Play around with the colours, text size and font until you find a theme you enjoy. Remember you can change the theme at any time.
When working in RStudio you always want to work inside a project. A project is a special folder that let’s RStudio know that all your code, data and other files are in the same place.
Projects also help with reproducibility. Everything you need to replicate your work should be inside the project folder.
Here you have two options
New Directory: make a new folder on your computer, where a project will live.
Existing Directory: change a folder into a project folder.
However, later when you are working on new projects you might already have a directory that has the files you need, then you can use existing directory.
We’ll learn about R packages and Shiny Web Applications later.
Something like ‘FirstProject’ will do for just now
Choose where to put the project
Click ‘Create New Project’
You can tell when you are inside a project because the top left hand corner will show the name of the project.
If you want to have multiple R projects open at the same time then can either:
At first your projects might only have one notebook and one dataset in them. However, later in the course you will be making big, complicated projects that use many notebooks, R scripts, datasets and produce a range of outputs including PDFs, graphics and new data.
You may want to introduce a folder structure that looks something like this.
notebooks
scripts
raw_data
clean_data
outputs
Where you keep all your notebooks in the notebook folder, all your R scripts in the scripts folder etc.
Now we are going to write some R code in a notebook.
Notebooks are files that have a mixture of code and text. A notebook file has the extension ‘.Rmd’, we will be mostly writing in notebook files throughout this course.
The other option to write code in a script. You will be doing this later. Scripts are special text files that contain only code. Any file that has R code in it will have the extension ‘.R’.
When you make a new notebook it’s very important to remember to save it.
Give your notebook a name like ‘FirstNotebook.Rmd’
You can also save by pressing cmd+s.
The notebook you’ve created should tell you a little about notebooks.
We would strongly recommend that you get used to the keyboard shortcuts. You will be writing a lot of code in the next weeks - the shortcuts will save you a lot of time!
Read about how to make a new chunk.
10 + 10
## [1] 20
mtcars
Both the cars dataset and mtcars are inbuilt datasets. Most of the time we will be working with data from external sources that you will need to read in.
You are going to be writing a whole bunch of code over the next weeks! To save time you should be using the keyboard as much as possible.
10 + 10
## [1] 20
3 + 3
## [1] 6
5 + 10
## [1] 15
cmd and arrows to skip to the end of a rowalt and arrows to skip in “chunks”shift to highlight.For more RStudio specific shortcuts see: Tools → Keyboard Shortcuts Help.
Another useful shortcut is being able to use multiple cursors. This is particularly useful when you are editing several similar lines at once.
alt + cmd and click to place another cursor.alt + cmd + shift to move the cursor.alt and drag to place multiple cursors in a line.And that’s your introduction to RStudio!
Remember to use a new project every time - it takes a little bit of getting used to, but soon you’ll have no problem.
Try to learn the keyboard shortcuts. They make life easier in the long run.